Efficient Parallel Statistical Model Checking of Biochemical 

Networks 



P. Ballarini M. Forlin T. Mazza D. Prandi 

The Microsoft Research - University of Trento Centre for Computational and Systems Biology 
{ballarini , forlin, mazza, prandi}@cosbi . eii 

We consider the problem of verifying stochastic models of biochemical networks against behavioral 
properties expressed in temporal logic terms. Exact probabilistic verification approaches such as, for 
example, CSL/PCTL model checking, are undermined by a huge computational demand which rule 
them out for most real case studies. Less demanding approaches, such as statistical model check- 
ing, estimate the likelihood that a property is satisfied by sampling executions out of the stochastic 
model. We propose a methodology for efficiently estimating the likelihood that a LTL property (j) 
holds of a stochastic model of a biochemical network. As with other statistical verification tech- 
niques, the methodology we propose uses a stochastic simulation algorithm for generating execution 
samples, however there are three key aspects that improve the efficiency: first, the sample generation 
is driven by on-the-fly verification of (j) which results in optimal overall simulation time. Second, 
the confidence interval estimation for the probability of to hold is based on an efficient variant of 
the Wilson method which ensures a faster convergence. Third, the whole methodology is designed 
according to a parallel fashion and a prototype software tool has been implemented that performs the 
sampling/verification process in parallel over an HPC architecture. 

1 Introduction 

Systems biology |fT9l is concerned with developing detailed models of complex biological networks, 
which then need to be validated and analysed. A model of a biological system essentially describes the 
dynamics of a population of n interacting biochemical species S\,S2, ■ ■ -S n . Analysing the behaviour of 
such models entails looking for the occurrence of biologically relevant events during the evolution of the 
system, a problem whose complexity is exponentially proportional to the population of the considered 
model. 

In this paper we consider discrete stochastic modelling of biological systems. In the discrete- 
stochastic setting, biochemical species are enumerable quantities representing the number of molecules 
of a given substance, and the evolution of the system is probabilistic, rather than deterministic, leading 
to Continuous Time Markov Chain (CTMC) models. 

Classical transient-state and steady-state analysis [30] allows to understand important features of a 
CTMC model. Unfortunately numerical solution of Markov models is undermined by storage require- 
ments which explode with the dimension of the model. Although techniques for the efficient storing 
of CTMC's matrix/vector have been widely studied (see for example ll23l|29l ), solving CTMC models 
remains unfeasible, in most realistic case studies. 

In recent times, query based verification methods (i.e. model checking [10]) proved to be valuable 
instruments for a more expressive analysis of stochastic biological models ll2Tl l4ll5l. Biologically rel- 
evant events can be formally characterised as temporal logic formulae that can then be automatically 
checked against a discrete state model. Exact probabilistic PCTL lfl6l and stochastic CSL Em model 
checking suffer, as well, of the state-space explosion problem which limits their accessibility specially 
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in systems biology where very large models are just common. Statistical model checking |33l l9l[T2l has 
been proposed as an alternative to exact probabilistic model checking that allows for getting an estimate 
of the likelihood of a condition to hold of a CTMC model. 

The statistical model checking paradigm essentially is a combination of efficient stochastic simula- 
tion with model checking procedure. It comprises three ingredients: (i) a stochastic engine that generates 
trajectories of the underlying state space; (ii) a model checking algorithm capable of analyzing a single 
trajectory; {Hi) a statistical support for estimating the accuracy of the answer. The advantage of statistical 
model checking is that, contrary to exact model checking, it does not need to build (i.e. store) the state 
space of the model as it only explores a limited number of (finite) trajectories. The cost paid for such 
space saving is in terms of precision of the calculated measure. 

We propose a methodology that given a CTMC model M, a property and a desired level of confi- 
dence estimates the probability of </> to be satisfied by M with the estimated measure meeting the desired 
confidence. The method we propose is based on three key aspects. Firstly the trajectory generation is 
controlled by on-the-fly verification of the considered formula which means that simulation halts as soon 
as a state which verifies (falsifies) the formula is reached. Secondly, we use an efficient statistical method 
(i.e. a variant of the Wilson score interval method) which results in smaller samples (i.e fewer simulation 
runs) in order to meet the desired confidence. Finally the whole simulation/verification framework has 
been designed and tested on a parallel prototype, which is based on independent simulation/verification 
engines generation, and a MPI client/server parallel computation architecture. 

The reminder of the paper is organised as follows: in Section|2]we introduce the basics about CTMC 
modelling and stochastic simulation algorithms. In Section|3]we describe the core of our statistical model 
checking procedure; in Section [4] the architecture of the prototype which exploits the methodology is 
described. In Section[5]we provide evidence of application of statistical verification of relevant properties 
to an abstract model of the budding yeast cell cycle. Concluding remarks are given in the final section of 
the paper. 

2 Stochastic Simulation of Biochemical Networks 

The time evolution of a well stirred biochemical reacting system can be described as a Continuous-time 
Markov Chain (CTMC) whose states are vectors X(t) of discrete random variables X{(t), that give the 
amount of a molecule i at time t . The associated joint probability distribution ^(X,t), often called Chem- 
ical Master Equation (CME), gives a precise description of a biochemical network lfT31 . Unfortunately, 
as soon as the system becomes non-trivial, it is nearly impossible to solve CME (i.e. the corresponding 
CTMC), neither analytically nor numerically. The Stochastic Simulation Algorithm [14] is a computer 
program that takes a biological reacting system and produces a trace in the space of the CME. In the 
following we first introduce the basics about CTMCs and then we describe how the stochastic simulation 
algorithm works. 

2.1 Continuous Time Markov Chains 

A Continuous-time Markov Chain (CTMC) for biological modelling is a tuple M = (S,so, Q,Var,Eval) 
where S is a finite set of states, so is the initial state, Q is the infinitesimal generator matrix of the chain, 
Var is the set of state-variables and Eval : S x Var — > R>o is the evaluation function that, given the 
variable x and state s, returns the value of x in s. 

A path of a CTMC M is a possibly infinite sequence of 5 x 1 >0 pairs a = (so,to),(s\,ti) . . . (s n ,t n ) . . . 
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describing a trajectory of M, also denoted: 

SO ► ^1 ► S2 ■ ■ ■ > S n ► . . . 

where V/ G N Si G S and <2(j/, > and G M>o- A point of a indicates that at the i-th step of 
the execution a the system was in state s; and that it stayed in for t(. We adopt the following notation: 
o[i] denotes the z'-th state of a; a' denotes the i-th suffix of a; 5 (a,/) denotes the time spent in the z'-th 
state of a whereas o@t denotes which state a is in at time t. 

The state- variables War of a CTMC biochemical model will coincide with {Si,. . . the set of 
biochemical species. 

2.2 Stochastic Simulation Algorithm 

The Stochastic Simulation Algorithm (SSA) supposes a well-stirred set {Si,...,S/v} of biochemical 
species reacting through M > 1 reaction channels (reactions for short) {R\ ,Rm}- A reaction channel 
Rj is characterized by: 

• the probability aj(x)dt that, given X(t) = x, one reaction Rj will occur in the next infinitesimal 
interval [t,t + dt); 

• the change v yi of the number of molecules of the specie Si produced or consumed by a reaction Rj. 

The physical and mathematical explanation of aj{x)dt is described in IPT51 . Basically, aj(x) is a function 
of the number of possible active instances of reaction Rj. Consider, for example, a reaction 7?i : Si +S2 — ► 
S3, then a\(x) = c\X\Xi, where x\X2 is the number of active R\ in the current state x, and ci is a constant 
that depends on the physical characteristics of Si and S2. 

Given aj(x), the evolution of a biochemical network is described by the next reaction density function 
p{z,j\x,t), i.e. the probability, given X(t) = x, that the next reaction in the system will occur in the 
infinitesimal time interval [t + % y t + T +dt) and will be on channel Rj: 

p(xj\x,t)=aj(x)e- a ^ (1) 

where ao = Ylf=\Cij{x)- By conditional probability, Eq. |T] becomes p[r,j\x,t) = pi(z\x,t)p2{j\'C,x,t) 
where 

Pi(T\x,t) = a (x)e- aoi * )T (t > 0) (2a) 

P2 (jT,x,t) = (2b) 

meaning that, T is a sample from an exponential random variable with rate oq(x) , and the selected reaction 
;' is independently taken from a discrete random variable with values in [1,M] and probabilities ^44- 
Standard Monte Carlo methods are then used to select time consumed and next reaction according to 



Eq. 2a and Eq. 2b to produce a trajectory in the discrete state-space of the CME. Several variants of the 
SSA exist [22] that differ in how the next reaction is selected and in the data structures used, but all of 
them are based on the following common template: 
1 Compute ao 



2 Randomly select a reaction j according to Equation 2b 

3 Randomly select a time duration X according to Equation [2a 

4 Update state vector as x <— x + vj and total time as t <— t + % 

5 Go to step 1 or Terminate 
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3 On-the-fly statistical BLTLc Model Checking 

Statistical verification of a CTMC model M is based on the simple principle of collecting N sample re- 
alisations Oi (i G {1,. . . ,N}) of M and verifying each of them against a given property (p. The estimate 
of the likelihood of to hold true of M is obtained as the frequency p§ = ^? of positive outcomes (po) 
of the verification of </> versus a,. In the following we introduce the temporal logic we refer to as the lan- 
guage for stating properties of simulated trajectories, namely the BLTLc logic. We then provide details 
of a procedure for on-the-fly simulation- verification of a CTMC model which combines the random gen- 
eration of a trajectory, by means of stochastic simulation (see Section [2]), with a verification procedure 
based on the BLTLc semantics. Finally we describe a statistical procedure for the efficient estimation of 
the likelihood of a formula. This procedure iteratively works out the number of simulation runs needed 
in order to meet the desired level of confidence. 

3.1 Bounded Linear-time Temporal Logic with numerical constraints 

We introduce the Bounded Linear-time Temporal Logic with numerical constraints (BLTLc) a logic for 
stating properties referred to timed-trajectories resulting from stochastic simulation of a CTMC model. 
BLTLc combines the Constraint-LTL OEl (LTLc or LTL(R) ) for arithmetic-constrained LTL formulae 
with the Bounded-LTL (BLTL) [9] for time-bounded LTL formulae. Both LTLc and BLTL are based 
on classical LTL l28l temporal operators. However while LTLc allows for using (complex) arithmetical 
conditions between state variables, it does not allow for expressing time bounded conditions. On the other 
hand with BLTL time bounded LTL expressions can be formed but based on simple non-arithmetical 
conditions rather than on a grammar for arithmetic expressions as it is the case with LTLc. The syntax 
of the BLTLc logic is given by the following grammar: 

(j) ■= V al < val | -.0 | V (j) | (j) A (j) \ X 1 (j) | U 1 ^ 

vol := x | val ~ val \ func \ Int \ Real 

where / C M>o, <!G {<,<,>,>, =,7^}, ~G {+,—,*,/} and func denotes common functions such as 
pow(),sqrt(), . . .. Note that unbounded formulae correspond to / = [0,+°°) (for simplicity [0,+°°) is 
usually omitted for unbounded formulae). The operators -1, V and A are the standard boolean logic 
connectives not, or and and respectively, whereas X 1 and U 1 denote the temporal connectives, next and 
until, respectively. The next operator refers to the notion of true in the next state within time / whereas 
the until operator (0 U ! y) indicates that a future state where its second argument (y) holds is reached 
within time / and while its first argument (0) continuously holds. The two popular Finally operator 
(F 7 (0), which refers to the notion of a condition holding true at some point in the future and within time 
/), and Globally operator (G 7 (</>), which refers to the notion of a condition to continuously holding true 
within time /), are also supported by BLTLc relying on the well-know equivalences F 7 (</>) = [tt U 1 ^] 
andG / (0) = ^ J F / (^</)). 

BLTLc formulae are evaluated against timed-paths resulting from simulation of a CTMC model. The 
formal semantics of BLTLc formulae, expressed in terms of the |= relation, is given below, where a is a 
timed-path of a CTMC model. 

• a \= (val' < val") if and only if Eval(a[0},val') < Eval(o[0],val") 

• <7 |= -i0 if and only if a ^= 
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// The procedure returns a ( Result , Trace ) pair, where Result is the boolean 
// result of verification A Trace is the simulation trace generated. 

(Result, Trace) simulate .verify ( i , OW/ , M ,0, t nwx ) 
( s , t ) = Obuff [ i ] ; 
case of 

propositional formula: return (Eval(y/, M, s), Obuff)\ 
-i y. (r, tr) = simulate_verify ( i , <j buff , M, 0, t max ); 

return ( not r , tr ) ; 
ift A 1/2 : ( r ' tr) = s imul ate .verify ( i , Obuff > M, 0, t max ); 

(r' , tr' ) = s imul ate .verify ( i , tr , M, , t max ); 
return ( r A r' , tr' ) ; 
X 1 y/: if (t £ /) then return (false , (J/,,,//); 
if (i == max .index ( (Jw/ ) ) 

{ ( s next , tnext ) = Next ( ( s , t ) , M) ; 

&buff + — ( Snext 7 hiext ) j } 

return simulate.verify ( i + 1 , ow/ , M, 0, t max ); 
WiU 1 y/ 2 : if (t g /) return (false, o buf f); 

(r, tr) = simulate_verify(i, Obuff < M, l/^ , tn^); 

if ( 1 a s t ( tr ) . time > t max ) return (false, tr); 

if ( r ) then return (true, tr); 

(r' , tr' ) = s i mul a t e _v eri f y ( i , tr , M, t inax ); 

if ( -i r' ) return (false , tr' ) ; 

if (i == max.index ( tr' ) ) 

{ ( Snea > t next ) = Next((s,t), M) ; 

Obuff + — ( Snext 7 tnext ) > } 

return simulate_verify ( i + 1 , tr' , M, 0, t max ); 



Figure 1: On-the-fly verification of simulated trajectory 

• a \= 0' V 0" if and only if a |= 0' or a |= 0" 

• a \= 0' A 0" if and only if a |= f and a |= 0" 

• a \=X'^> if and only if a 1 |= and <5(a,0) G / 

• a |= 0' £/ 7 0" if and only if 3/ G N : o> \= 0" and £ y<i 5(ct, j) G / and Vj < i, a* \= 0' 
£Va/(a[/],vaZ) is the evaluation function that assigns a numerical value to the expression vol by looking 



up at the value that each variable x G Var{val) has in state a[i] of path a (see Section 2. 1 1. 

As an example of the expressiveness of the BLTLc logic consider the following formula = [(X\ < 
Sqrt(X2)) U (X2 > IO + X3)] which states that the concentration of X\ shall be less than the square root 
of that of X2 until that of Xi exceeds X3 by at least 10. 

Finally we stress that differently from the original BLTL ||9), the association of probabilistic bounds 
to BLTLc formulae is not supported. This is because the aim of the statistical verification procedure we 
introduce is to estimate the actual measure of probability of a BLTLc formula rather than to estimate 
whether such measure is below/above a certain threshold. 
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3.2 On-the-fly verification of simulated trajectory 

We define a procedure that given a CTMC model M and a BLTLc formula <p allows for stochastically 
generating trajectories of M while verifying whether the generated trajectories satisfy or falsify the con- 
sidered formula <p. The verification of <p is performed on-the-fly meaning that simulation proceeds with 
the generation of the next state only if <p is neither satisfied nor falsified in the current one (and if the 
simulation time limit has not been reached). The pseudocode for the verification procedure is outlined 
in Figure [T] Function simulate. verify () takes few input parameters: a BLTLc formula </>, a trace Gbuff 
(containing the pre-viewed trace resulting from recursive calls corresponding to the verification of sub- 
formulae) the current position i in the buffered trace, a CTMC M and a maximum simulation time t max . 
The algorithm works in the following manner: propositional formulae are verified trivially relying on 
the function Eval(<p ,M,s), that gives the value of a non temporal formula in a state s. On the other 
hand temporal formulae (may) require the generation of a simulation-trace, obtained through a call to the 
the stochastic engine function Next((s,t),M). Verification of temporal formulae may result in passing of 
the already-generated trace (i.e. the buffered trace Gbuff that results from verification of a sub-formula) 
from the inner-most sub-formulae to the outer-most ones. This is required for two reasons: first to avoid 
possible branching during the generation of a single trace (if verification requires to roll-back to a previ- 
ously generated state then re-application of the SSA may result in the generation of a different successor 
state, else in branching, which is wrong in the LTL context), secondly for efficiency (the already gener- 
ated traces should be re-used whenever feasible). For these reasons a call to the simulate verify () returns 
a pair (r,tr) where r is the boolean result of the verification of the considered formula </> and tr is the 
simulation trace generated up until verification of (p. The buffer trace Gbuff in the initial call will con- 
tain a single element Gbuff = {{soito)}, representing the initial state and time of simulation. Finally, for 
time- unbounded formulae, we adopt the following pessimistic approach: if the simulation max time t max 
is reached and the formula is neither verified nor falsified then the algorithm returns false. Thus the exact 
probability of time-unbounded formulae is an upper bound of the estimated one. 



3.3 Estimating the probability of a property 

Checking of a BLTLc property on a simulated trace corresponds to a Bernoulli experiment, where the 
outcome can be either positive or negative. Thus the number of successes that results from reiterated 
checking of the same property on n independent simulations, represents a random variable X with a Bi- 
nomial distribution. The point estimation of the unknown probability of success p out of n independent 
trials, is given by the well known maximum likelihood estimator p = po/n, where po represents the num- 
ber of successes. Clearly the reliability of such estimate is highly affected by the number of simulations 
performed, i.e. by the sample size n. As a consequence the point estimate p is usually associated with 
a confidence interval, expressed in terms of a real value a G (0, 1), which represents the range within 
which the actual value of the unknown parameter 6 (i.e. the actual probability of the considered formula 
to hold against the simulated model) shall fall (1 — a)% time^] 

The standard approach to compute the confidence interval for the probability of success of a binomial 
distribution uses the normal approximation, producing the so-called Wald interval. The hypothesis- 
testing-based statistical model checking approach of Younes et al. ll33l . for example, uses the Wald 
interval method for validating the hypothesis that the probability of a certain CSL formula is below 

'There exists a strong connection between confidence intervals and hypothesis testing: all the values Qq for the unknown 
parameter external to a 1 — a confidence interval would end in the rejection of the two sided hypothesis testing (i.e. Null 
Hypothesis Hq . 6 = 6*o) at the a level. 
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(above) a given threshold. 

Brown et al. [21, 0, have studied the coverage characteristics of different types of binomial pro- 
portion confidence intervals, and they showed that the Wald interval present unstable coverage char- 
acteristics also for large n, suggesting thus the use of other types of confidence intervals. Among 
the interval discussed, the Wilson score interval [32] have shown good coverage characteristics also 
for small n and extreme probabilities. Wilson score confidence interval is calculated by means of 
WilsonJnterval(p,n,a) = [L,U] with, 

% U] = , V (3) 

1 n Z l-a/2 

where p is the estimated probability from the statistical sample, a is the confidence level, Zi- a /2 i s 
the 1 — a/2 percentile of a standard normal distribution, and n is the sample size. As the confidence 
interval is a function of n we may reverse the problem and ask which is the proper sample size to obtain 
a confidence interval of a given width at a specific confidence level a. This is extremely useful for 
establishing how long a simulative experiments has to be (i.e. how many runs are needed) in order to 
meet a desired reliability for the measure we estimate. 



3.3.1 An iterative algorithm for sample size determination 

The sample size required for the Wilson interval of width 2s at 1 — a confidence level can be obtained 
simply by solving for n the Wilson score limits in equation ([3]) |[27l : formally this is given by function 
Wilson^sample(p,e,a) = Af with, 

2 P(l-P)~ 2e 2 + y/p(l - pf + 4 £ 2(p - 0,5)2 

N > z l-a/2 ^2 ( 4 ) 

where p is the frequency of positive outcomes of a re-iterated Bernoulli experiment. As we usually do 
not have a guess of the probability p to be used in ([4]), the standard approach is to take a conservative 
estimate, by considering p = 0.5 which is the estimate with maximum variance, and, as such, produces 
the highest sample size. 

The method we present here consists in adopting a different approach in determining the sample size. 
By iterating Q with successive estimates of p we are able to drastically reduce the number of samples re- 
quired when the true probability p is far from 0.5. More specifically, given a confidence interval width 2s 
and a confidence level 1 — a, the algorithm starts by calculating the sample size required for an initial es- 
timate p = 1 (or equivalently p = 0) and returns the minimum number Af of simulations to be performed. 
After computing the proportion of successes the new estimate p is rounded by adding or subtracting the 
quantity £ if p < 0.5 or p > 0.5 respectively. The rounded estimate p' is then used to recalculate the sam- 
ple size resulting in N'. If we have already performed a cumulative N to t > N' simulations the algorithm 
stops. Conversely, we iterate the process again by launching N' — N tot simulations. 

The p rounding step is crucial and ensures that a successive sample size calculation would avoid 
undersized samples due to erratic estimates. If the current estimated probability of success drifts from 
the true unknown one, towards extreme probabilities, of more than e, this would produce an undersized 
sample which would produce a confidence interval not covering the parameter at 1 — a level. 
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N tot = 0; 

N= Wilson_Sample ( 1 , e, a); 
Perform N experiments ; 

N M = N, ot + N; p = ^ ; 
if p < 0.5 

p' = fi + e; 
else p' = p - e ; 

N' = Wilson_sample (p\ £, a); N new = N' - N tot ; 
if AW > 

N = AW-; goto 3; 
else return (/?, Wilson.interval , A^ of , a)); 



Figure 2: Iterative Wilson method for sample size determination 



3.3.2 Performance of the iterative sample size determination 

The iterative method for the determination of sample size for the Wilson score interval drastically reduces 
the number of required samples with respect to the conservative approach that starts with p = 0.5. Of 
course this gain is greater when the actual p is close to extreme values p = and p = 1 , while using the 
same sample size for p close to 0.5. 




Figure 3: Average sample size for CI-width= 0.05 at 99% confidence lelvel. Comparison of sample size 
required with i) conservative approach, ii) Iterative Wilson, iii) Minimum sample size with known p 

Figure[3]represents the sample size required to obtain a confidence interval of a given width 2s = 0.05 
at a confidence level 1 — a = 0.99 for values of p ranging from to 1 by 0.1. Bars from light gray to 
dark gray represent respectively: I. Conservative approach (using always />=0.5); II. Iterative Wilson; 
III. Minimum sample size if the unknown p would be known (i.e. p = p). On the right side of the plot we 
reported the percentage of reduction in sample size by using the Iterative Wilson approach with respect to 
the Conservative approach. As it can be easily seen, we have the strongest reduction in sample size when 
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the true p is close to or 1 . By using the Iterative Wilson method we presented here, we can reduce the 
sample size required by up to 88% with respect to the Conservative approach. This reduction at extreme 
values is explained by the fact that at those probabilities the variance of the binomial distribution is 
smaller, so we need less samples to obtain good estimates. 



4 Prototype architecture 



What so far illustrated has been implemented within a distributed software architecture, capable to get 
the better out of the MRiP computational policy (refer to for a wide overview of this topic). It 
is actually made of two distinct modules: a graphical front-end and a remote simulation engine. The 
front-end part acts as a server and is in charge of drawing the computational graph relative to the loaded 
BlenX |[TT1l models. BlenX is a new programming language intended to develop executable biological 
models starting from the composition of the description of the molecules involved in the system. The 
prototype collects any information from a BlenX model and serializes them in a proprietary, xml-based, 



data format along with all the simulation information manually inputted by the user (see Figure 4(a) 



State A). Further information about the logical formula to be checked, the a and e values are required 
only whenever one wants to automatically calculate the number of replicated simulations (or "replicas") 
needed to reach the required confidence threshold. However, both in the case that the number of replicas 
is user-defined and that it is automatically computed, a random number generator is instantiated and used 



to make a stream of initial seeds, one for each simulation (see Figure 4(a) - State B). 

As soon as the simulation task is invoked by the user, a number of independent simulation engines 
is instantiated (see Figure |4(b)| - Activity A). Among them, one is entitled to be master. The master 



handles both the inter-process and the client-server communications. In the former case, it takes care 
of scattering and dispatching the initial seeds to the slave processes (and to itself) and of gathering the 
results (see Figure 4(b)| - Activities C). MP I is broadly used for these purposes. In the latter case, the 
master node is responsible for counting the computed YES and for its sending to the server (see Figure 
4(b)| - Activities C). This is accomplished by means of a socket-based interface. Hence, each process 



simulates independently (see Figure 4(a) - State C and Figure 4(b) - Activities C) and evaluates on-the- 
fiy a logical formula, giving a boolean answer. The summation of the positive answers is sent to the 
server, which recomputes the Wilson method and returns a new number of simulation replicas to be 
performed (see Figure [4(a)] - State D and Figure [4(b)] - Activity B). This loop halts only when no more 



replicas are required to be performed. 



4.1 Policies of communications 

The prototype initializes as many independent processes as the replicas number estimated by the Wilson 
procedure. In agreement with the SIMD (Single Instruction, Multiple Data) computational policy, they 
share the same code area, namely they execute the same instructions in a parallel fashion on different 
instances of the same input model. In particular, two important tasks are executed in parallel: the import 
of the input model and the actual simulation. The former task consists just in loading a BlenX model and 
in extracting some important information from it. This can happen essentially because all the processes 
address a chunk of shared filesystem area where the input model lies in. After that, every process starts 
its independent simulation according to the parameters entered by the user. 

Therefore, we set two synchronization points. One occurs before starting the simulation task, exactly 
when the starting random seeds are computed. Actually, this would not be strictly necessary because, 
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(a) Wilson-driven computational loop 



(b) Inter-process and client-server communications 



in principle, every process could generate its own starting random seed. But, to tackle the problem of 
avoiding correlation among the individual trajectories of random numbers EH, we empowered the mas- 
ter process (the process with the lowest MP I rank) to generate as many random numbers as the number 
of processes and to split and distribute it to each other process (through the MPI_Scatter procedure). 
Contrarily, a very critical synchronization point concerns the Wilson step. It occurs whenever all the 
processes end their simulation and must communicate their YES number as well as must write the sim- 
ulation traces to filesystem. Here, two kind of communications take place. One happens among the 
processes themselves. Anyone send the master its computed simulation trace along with a boolean an- 
swer to the formula (via the MPI_Gather procedure), both encapsulated in an ad-hoc serializable data 
structure. Thus, the master node prints the traces to file and computes the probability p, as described 
before. Finally, it sends back p to the server side via a private socket for the Wilson re-computation. 



5 Experiments and Performances 

We consider a stochastic model of (a part) of the regulatory network that controls the buddying yeast 
cell cycle [26] and we present some preliminary experiments. We adopt the following protocol: first we 
identify few BLTLc formulae characterising relevant aspects of the cell-cycle behaviour. We then run our 
statistical verification tool to estimate the probability of the considered formulae. To assess the accuracy 
of the statistical procedure we compare the estimates obtained through our statistical model checker with 
the exact values calculated through numerical model checking, namely by means of the PRISM model 
checker |[20l . As the state-space dimension corresponding to the original cell-cycle model is too large 
to be handled through numerical model checkers we consider a "scaled-down" version of the model for 
validating the statistical model checkin approach against the numerical one. 

Finally, for assessing the performances of the improved Wilson method we compare the number of 
simulations performed by running every experiment twice: once with the initial point estimate a = 1, 
and then, following the original "conservative" approach, with a = 0.5. 

The cell cycle is the a concatenation of biochemical and morphological events that lead to the repro- 
duction (duplication) of a cell. The "standard" model considers a loop of four phases: Gl, growth and 
preparation of the chromosomes for replication; S, synthesis and duplication of DNA; G2, synthesis of 



Ballarini, Forlin, Mazza & Prandi 



57 



R3ys ' 



(c) 



Rly 









^^^^ 










a: Cdc20 










V 






\ x: Cdk/CycB 
















L 









(d) 



Figure 4: Budding yeast cell-cycle cartoon and simulation 



significant protein for mitosis; M, mitosis, i.e., cell division. The cell cycle is regulated by a network 
of biochemical reactions centered around complexes of cyclin dependent kinases (Cdk's) and their regu- 
latory partners. Active complex Cdk/CycB induces cell cycle phase changes by activating or inhibiting 
target molecules lT24l|25l . 



The simplified model we consider is sketched in Figure 4(c) It consists of three species, x (Cdk/CycB 
complex), y (activated APC/Cdhl complex) and a (activated Cdc20), and nine molecular reactions listed 
in Tableland with parameters in Table [2] Complex x is synthesized and degraded by reactions R\ x and 
R2x, respectively. Complex y speeds up x production by means of reaction 7?3 V . At the same time, x 
deactivates y by R^ y . Also, y turns active by itself with R\ y and with the help of a in reaction /?2 V - Finally, 
a is produced and consumed by R\ a and R2 a and regulated by x in reaction /?2«- Such system behaves has 
a bistable switch with two stable states: Gl with low x and high y, and S/G2/M with high x and low y, as 



shown in the plot depicting a typical simulation outcome in Figure 4(d) 
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Table 1 : Budding yeast cell-cycle reactions 



Studying the role of Cdc20 in the SIG2IM transition. We target the experiments of this section to 
the study of the so-called SIG2IM transition which begins in states with low level of activated APC, 
high concentration of Cdk/CycB and (initially) low level of Cdc20. By looking at the topology of the 
network in Figure 4(c)| and at the form of the corresponding equations (Table [TJ, it is evident that a (i.e. 



Cdc20) plays a fundamental part in the activation of y hence in the controlling the SIG2IM transition. 
Specifically the progressive growth of a results in the (initially slow) activation of y which then, in turns, 
is responsible for the degradation of x. The influence of a on y can be studied through BLTLc formulae 
of the following type: 

<j>i = {a < i) U{y> j), h = {a< i) (y > j) 
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Component 


Rate Constant 


Dimensionless constants 


Cdk/CycB 


h = 0.04, k' 2 = 0.04, t% = 1, = 1 




APC/Cdhl 


k' 3 = 1, k*{ = 10, fc 4 = 2, £ 4 = 35 


7 3 =0.04,/ 4 = 0.04 




7 * /^mouvv / *3 cca yin 
K 3 - J 4 +^(av) k i - J 3 +{ayi„ 


y 5 = 0.3 




-tjf. k-$yin 

K 4 ~ J3+(ay,„) 


m = 0.80 


Cdc20 


k' 5 = 0.005, t% = 0.2, £ 6 = 0.1, h = 35 

t" I 
. * _ k_ /J5_ 

5 a / mai 


a = 0.00236012 



Table 2: Parameter Values: Cell Cycle toy model 



Formula 0i represents the possibility that y grows above threshold j while a does not exceed threshold 



i. Since y gets abruptly activated only after a has reached high concentration (see Figure 4(d) ) then, for 
i < j and 8 = j — i, we expect a low probability of 0i for large 8, and a higher probability of 0i for high 
i and small c0 

Figure[5]compares exact versus estimated probability measure for the time-bounded formula 02, ver- 
ified with respect to different time points (t G [0.2, 1.6] step 0.2). The (cross marked) point estimates 
(depicted in Figure [5] together with their confidence interval), have been calculated with 99.99% confi- 
dence and 0.005 interval semi-amplitude (e = 0.005). The exact values computed with PRISM (red plot 
in Figure [5]) fall within the confidence interval of each point estimates, confirming the accuracy of the 
statistical verification method we have realized. 

Further data regarding application of statistical verification to formulae <p\ and fa (and variants) 
are reported in Table [3] The calculated estimate (/>), confidence interval ([L,U]) corresponding to the 
chosen confidence level (conf) and interval width (e) are depicted together with the exact measure of 
probability p (calculated with PRISM). For each experiment we also calculate the sample size (AO as 
well as the average path length \o\. Results in Tableplindicate a good accuracy of the obtained statistics 
and confirm the performance gain (i.e. in terms of sample size) of the sequential algorithm of Figure [2] 
versus the conservative approach corresponding to a = 0.5. As argued before such gain is greater with a 
larger distance of the estimate from the median. 



Table 3: Estimated probability of cell-cycle properties vs numerical evaluation. 



formula 


P 


[L,U] 


conf 


e 


a 


P 


N 


M 


{a < 4) U (y > 5) 


0.10384 


[0.080518,0.12700] 


99 


0.025 


1 


0.10458 


1124 


84.74 


(a < 4) U (y > 5) 


0,09554 


[0,081823,0,11128] 


99 


0.025 


0.5 


0.10458 


2648 


84,61 


{a < 20) U (y > 35) 


0.00534 


[0.080518,0.12700] 


99 


0.025 


1 


0.00571 


187 


1274.65 


(a < 20) U (y > 35) 


0,00717 


[0,00401,0,01280] 


99 


0.025 


0.5 


0.00571 


2648 


1296,72 


(a < 40) U (40 < y < 42) 


0.62004 


[0.59472,0.64472] 


99 


0.025 


1 


0.6312 


2495 


4597.67 


(a < 40) U (40 < y < 42) 


0.63085 


[0.60064,0.65064 ] 


99 


0.025 


0.5 


0.6312 


2648 


4721,63 



2 (j>2 allows also to study the dependence on time of such an attitude. 
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Figure 5: Exact vs Estimated probability of the time-bounded Until formula: (a < 4)U^°-'\y > 5), esti- 
mates with 99.99% confidence and e = 0.005 semi-interval amplitude. 



6 Conclusions 

The stochastic modeling framework have been demonstrated very important in systems biology. Un- 
fortunately the complexity of living systems most often results in models which are so large that meth- 
ods based on numerical approaches such as, for example, transient/steady-state analysis and/or exact 
(stochastic) model checking are simply not feasible. For this reason, stochastic models of biological 
systems are usually studied by means of simulation-based approaches. In this paper we have presented a 
statistical model checking approach targeted to the verification of biological models. A novel temporal 
logic language, namely the BLTLc language, has been introduced: it allows to formally capture complex 
features of a biological system's dynamics. The methodology we proposed employes a statistical engine 
to estimate the probability of a BLTLc formula to hold true of a stochastic model M. Such estimates 
is obtained by on-the-fly verification of the considered BTLTc formula against simulated executions 
of the M. The resulting estimates is given by the frequency of the positive answers (i.e. the number of 
TRUEs resulting from the verification of against each simulation trace) out of the total of the simu- 
lated traces. The statistical engine we introduced is based on an efficient variant of the so-called Wilson 
score-interval method, which improves the performances, i.e. it requires a smaller number of trajectories 
in order to meet the chosen confidence level, with respect to more popular statistical engines such as 
those based on the so-called Wald-interval method. We have implemented the proposed algorithms for 
analysis and statistical testing as a prototype tool, which employs MPI technology to distribute verifi- 
cation engines. The current implementation exploits multi-core processors, in particular, our tests have 
been performed on a quad core machine Intel Q9300 CPU with 4G of RAM under Widows XP. A cluster 
and a GRID version of our tool is under development, that will maximize the parallelism of the proposed 
methodology. Moreover we are planning a complete suite of performance tests against tools with similar 
features, as those stated above. 

Related work. Techniques for the verification of temporal logic property against probabilistic/s- 
tochastic models, can be either exact or approximate. Exact approaches work by constructing a complete 
representation of a finite state space model and, because of this, their application to complex systems is 
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unfeasible. PRISM [20] and MRMC [ 18 ] are two popular probabilistic model checking tools that support 
both exact and approximated CSL verification. Approximated verification can be one of two different 
types. If the considered problem is to establishing whether the likelihood p of a formula is p < b where 
b G [0, 1] is a threshold and < {<, <, >. >} (i.e. model checking problem) then the outcome of verifi- 
cation is boolean and is determined based on Hypothesis testing. On the other hand if the problem is 
one of determining an estimate for p then this is achieved through confidence interval based techniques. 
PRISM approximated verification belong to the latter type: the size of the sample is determined stati- 
cally as a function of the chosen level of confidence and the desired approximation, rather than being 
calculated iteratively as function of intermediate estimates, as is the case with our method, whereas paths 
generation is controlled by on-the-fly checking of the considered formula. Furthermore although PRISM 
has been recently added with support for (exact) probabilistic LTL model checking, at the best of our 
knowledge, it currently supports statistical verification only for CSL (and not for LTL). The YMER 11331 
and MRMC tool, on other hand, features approximated (hypothesis testing based) model checking which 
uses on-the-fly verification of sampled path in order to decide whether the probability of formula is be- 
low/above a threshold. The Monte Carlo Model Checker MC2(PLTLc) [12] computes a point estimate 
of a Probabilistic LTL logic (with numerical constraints) formula to hold of model. MC2(PLTLc) does 
not include any simulation engine but works offline by taking a set of sampled trajectories generated by 
any simulation or ODE solver software. Besides MC2(PLTLc) calculates also the probabilistic domain 
of satisfaction for any free variable of PLTLc formula. Finally the APMC tool iTTTl features confidence 
interval based estimates of the probability of Probabilistic LTL and PCTL formulae to hold of either 
DTMC and CTMC models. 

Acknowledgments. The authors would like to thank Alida Palmisano for her valuable advises. 
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